SEARCH

SEARCH BY CITATION

Keywords:

  • link prediction

Abstract

  1. Top of page
  2. Abstract
  3. 1. INTRODUCTION
  4. 2. LINK INFERENCE: THE DYNAMIC NETWORK CLUSTERING APPROACH
  5. 3. COMMUNITY PREDICTION
  6. 4. EXPERIMENTAL EVALUATION
  7. 5. CONCLUSIONS AND SUMMARY
  8. 6. ACKNOWLEDGMENTS
  9. REFERENCES

Network and linked data have become quite prevalent in recent years because of the ubiquity of the web and social media applications, which are inherently network oriented. Such networks are massive, dynamic, contain a lot of content, and may evolve over time. In this paper, we will study the problem of efficient dynamic link inference in temporal and heterogeneous information networks. The problem of efficiently performing dynamic link inference is extremely challenging in massive and heterogeneous information network because of the challenges associated with the dynamic nature of the network, and the different types of nodes and attributes in it. Both the topology and type information need to be used effectively for the link inference process. We propose an effective two-level scheme which makes efficient macro- and micro-decisions for combining structure and content in a dynamic and time-sensitive way. The time-sensitive nature of the links is leveraged in order to perform effective link prediction. We will also study how to apply the method to the problem of community prediction. We illustrate the effectiveness of our technique over a number of real data sets. Statistical Analysis and Data Mining 2013 DOI: 10.1002/sam.11198


1. INTRODUCTION

  1. Top of page
  2. Abstract
  3. 1. INTRODUCTION
  4. 2. LINK INFERENCE: THE DYNAMIC NETWORK CLUSTERING APPROACH
  5. 3. COMMUNITY PREDICTION
  6. 4. EXPERIMENTAL EVALUATION
  7. 5. CONCLUSIONS AND SUMMARY
  8. 6. ACKNOWLEDGMENTS
  9. REFERENCES

In recent years, many forms of networked social media such as Facebook and Flickr have been rapidly burgeoning in terms of their membership and popularity. Many such social and media networks may contain different kinds of attributes such as text, tags or other meta-data, and may rapidly evolve over time. For example, the web, blog networks and social networks are dynamically interconnected with one another, and may continually experience a change in the node and linkage structure of the network. Such dynamic and heterogeneously connected entities are referred to as information networks. This has lead to a tremendous interest in the field of managing and mining such dynamic and heterogeneous information networks [2].

In many networks, the linkages continuously arrive over time. This results in a gradual change on the network structure over time. For example, in a social network, new linkages are continuously created over time. This often results in gradual densification of the underlying social network graph. In such cases, it may be desirable to predict future linkages between the entities. The derivation of links between entities is an extremely important problem from the perspective of a number of different social networking applications. This has lead to increasing interest in the problem of automated inference of the links in social networks [3–9]. We will design efficient link prediction techniques, which will be focusing on dynamic, heterogeneous and evolving networks, which contain a combination of different kinds of heterogeneous content and links [10–12]. In this paper, we will design an efficient online algorithm for a dynamic and heterogeneous evolving scenario in which new links are continuously added over time. For example, for a network containing 107 nodes, the number of possible pairs of nodes is of the order of 1014. This creates efficiency challenges for predicting the most likely pairs of nodes between which links exist. The challenge is especially great when the link-prediction model needs to be constructed in a dynamic way for heterogeneous networks. In heterogeneous networks, both the nodes and the links can be of different types. These techniques are not very effective for cases in which nodes have a very large number of different types with practically arbitrary content structure in different types. For example, in a heterogeneous information network, the nodes may be of different types such as author, conference or paper. Each of these different types may have different kinds of content or attributes. Thus, our model is very flexible, and can apply to practically any kind of dynamic, heterogeneous and content-based network.

The new links in the network may arrive in the context of new nodes being added to the network, or they may correspond to edges between already existing nodes. In some cases, an entire set of nodes together with its associated links may be received by the network, whereas in others, only individual edges or nodes may be received. Both the two scenarios are not very different from a conceptual point of view, because they can both be modeled with the scenario that a single node together with its links is received. Therefore, all future discussions in this paper will focus on this scenario. One challenge with the dynamic approach is that the structure of the network may evolve over time. This may affect our ability to perform effective link prediction.

In order to achieve these goals, we will use a dynamic graph-clustering approach in which fine-grained clusters are constantly maintained in the network. These clusters are created on the basis of structural similarity. The goal of the clustering process is to create a dynamic summarization which can be efficiently used for inferences in a very large network. The higher level of macro-processing divides the network into regions of high density in which more fine-grained decisions with the use of types and contents can be made. This structural behavior is used for macro-decisions, whereas both structural and attribute behavior is used for micro-decisions of deciding where the links should be placed. Thus, the link inference decisions are made with the use of a combination of content and links, within a particular structural locality of the network. We will show that such a local approach, which combines the content and structural information in a careful way, is very useful for the case of content-rich and heterogeneous information networks. The approach is efficient, and can be used in to perform the inferences efficiently in a dynamic way. We will refer to our algorithm as DYNALINK which corresponds to the fact that it is as dynamic and heterogeneous content-based link prediction algorithm.

A related problem is that of community prediction, in which we attempt to predict groups of nodes, which will form communities in the future. This can be considered a generalization of the link prediction problem, as it finds multiple sets of nodes which are likely to become related to one another in the future. Since our approach for link prediction is designed for dynamic network, it can be easily generalized to temporal community prediction. Therefore, we will design an algorithm for temporal community prediction, which uses the structure and content of the network in order to make dynamic predictions about the underlying communities in the future.

This paper is organized as follows. The remainder of this section discusses related work and contributions. In Section 2, we will discuss how to leverage the statistics for the problem of link prediction. In Section 3, we will discuss the generalization of the method to the problem of community prediction. Section 4 will study a number of experimental results. Section 5 contains the conclusions and summary.

1.1. Related Work and Contributions

The problem of link prediction has been studied extensively in the data mining and machine learning community [13]. Much of the work on this problem is based on defining proximity-based measures on the nodes in the underlying network [8,14,15]. The work in ref. 8 studied the usefulness of different topological features for link prediction. It was discovered in ref. 5 that none of the features was particularly dominant in different kinds of situations. A second approach is to study the problem in the context of statistical relational models [16–20]. However, these methods are restricted to relational models, and are not designed for dynamic networks, or cannot handle attributes of a relational nature. Recently, the problem of link prediction has also been studied in the context of wikipedia and web data [3,21]. Methods for using supervised random walks for performing link prediction are proposed in ref. 22.

The link prediction problem has also been studied more generally in the context of the classification problem [5,7], as the link prediction problem can be considered as a classification problem in which features and class labels (corresponding to existence or absence of links) can be associated with links to be predicted. While some work has focused recently on some aspects of the heterogeneous scenario [23–25], some of the methods have also been proposed for the temporal scenario [22,26], though these methods are not designed for the heterogeneous case. Furthermore, we design methods which can perform the link analysis in real time. Methods for performing link prediction across social networks with the use of transfer learning are proposed in ref. 27,28. Methods for inferring the labels of network ties with the use of cross-network learning are proposed in ref. 29.

This paper takes a unique approach toward such large-scale dynamic link prediction in networks by using topological behavior for higher-level decisions by using them in the clustering process, and the attribute behavior for more fine grained decisions. This paper also studies the problem of community prediction in social networks. The problem of community detection has also been widely studied in the social networking community [15,30–40]. Many of the methods use a combination of link and content for the community detection process [38–40]. The problem of evolutionary clustering has also been studied in the context of community detection [30–32,37]. However, none of these methods discuss the problem of predicting future evolution of communities in such networks. This paper combines dynamic link and content evolution analysis for community prediction in heterogeneous information networks.

1.2. Link Inference: Problem Definition

In this section, we will define the link inference problem for information networks. We assume that each node has a type associated with it. This type may be quite different depending upon the kind of network. For example, in a paper–authorship network, this type could correspond to paper, author, conference or other corresponding entity. In a movie database, the type could correspond to actor, movie or genre. The links between the different entities represent the nature of the relationships among them. These links could be of different types depending upon the nature of the underlying relationships. For example, a link could be a ‘co-authorship’ relationship between two author nodes, or it could be an ‘authorship’ relationship between an author node and a paper node.

Many of the applications which generate such networks are inherently dynamic. For example, co-authorship networks or military information networks are inherently dynamic in nature. Therefore, it may be assumed that new nodes or edges are constantly being added to the network, and similarly new nodes or edges are constantly being deleted. In our paper, we assume that each incoming entity may be a set of nodes together with the edge relationships between them. Furthermore, some of the incoming nodes may never have been encountered before. For example, in a co-authorship network, new authors and papers are continuously being added to the network.

We assume that a node of type r has dr different attributes. The type of the node is itself one of these dr attributes. These values are assumed to be discrete, though numerical values can also be converted to discrete values with the use of a discretization process. We note that there may be nodes of q different types which are denoted by {1…q}. These values are thus the relational attributes, which represent the properties of the different nodes. Thus, for a node i with type ti, these values are denoted by equation image respectively. These values can often be helpful in link inference, because the correlations in the values across the different nodes can be used for the inference process. We also assume that the domain of values for different attributes is distinct. This assumption is without loss of generality, because we can use a transformation in order to ensure that the values are distinct. Specifically, we can concatenate the following strings in order to create the following new attribute-value string: (i) a string containing the attribute name (ii) the symbol ‘#’, and (iii) the attribute value itself. For example, consider the case when attribute 3 of an author-type node is a keyword, for which one of the possible values is ‘clustering’, and attribute 2 of a paper-type node is also a keyword, for which one of the possible values is ‘clustering’. Then, both values are represented as ‘clustering#keyword’. However, if we track demographic attributes, one of which is gender, then the value of the attributes in a node corresponding to a female would be the value ‘gender#female’. We denote the entire domain of distinct values across all attributes and node types by equation image. We assume that equation image is the index of L distinct values, which are denoted by equation image = {1…L}. The distinctness of the content values at different nodes ensures that the attribute values at each node can be generally treated as a bag of values at each node. This is essentially the same as a vector-space representation of text data. In fact, the techniques of this paper are easily generalized to the case in which each node-type contains text content as opposed to a fixed set of discrete attributes for each node type.

In addition, for any pair of nodes i and j, a link (i,j) may exist, and the type of the corresponding link is denoted by P(i,j). The type of the corresponding link is drawn from {1…p}. Different kinds of queries can be formulated in the context of this problem. These are as follows:

  • (1)
    For a given pair of nodes, predict the relative importance of a link arriving between them in the future.
  • (2)
    For a given node, determine the links of a particular type which are most likely to emanate from that node in the future.
  • (3)
    Predict all the links of a particular type in the network in the future.

We note that all these queries need to be resolved in a dynamic way which takes into account the evolving structure and content of the network.

2. LINK INFERENCE: THE DYNAMIC NETWORK CLUSTERING APPROACH

  1. Top of page
  2. Abstract
  3. 1. INTRODUCTION
  4. 2. LINK INFERENCE: THE DYNAMIC NETWORK CLUSTERING APPROACH
  5. 3. COMMUNITY PREDICTION
  6. 4. EXPERIMENTAL EVALUATION
  7. 5. CONCLUSIONS AND SUMMARY
  8. 6. ACKNOWLEDGMENTS
  9. REFERENCES

One of the key challenges in link inference in heterogeneous information networks is that the link-prediction process requires the use of both linkage and attribute information. Furthermore, the dynamic nature of the network makes the prediction process even more challenging. One observation in ref. 8 is that the topological linkage structure can be leveraged quite effectively for link prediction. However, in many cases, the attributes may also contain valuable information for the link prediction process. However, the attribute information can often be sensitive to a particular locality of the network. For example, in a bibliographic information network, the keyword attribute corresponding to the word ‘network’ may not be very discriminative within the node cluster corresponding to the networking area, but may be quite discriminative within the database or data mining community. Therefore, the link-prediction process can be greatly enhanced with the use of local and context-sensitive information within a particular topological region. In order to leverage the content information in a more discriminative way, we will use a carefully designed approach which is dynamic and properly accounts for the local structural and content information during link prediction. We use the following broad approach:

  • The clustering process is used in order to segment the network into different local regions. Each region is densely populated, and is more likely to contain a larger portion of the links. Furthermore, each local region is likely to provide context-specific linkage behavior of the different attributes.

  • We use the clustered network in conjunction with the relational attributes at different nodes in order to design rules which relate the attribute combinations as well as the local linkage structure to predict the likelihood that a link will arrive in the future between a pair of nodes. As the network is already clustered, the model for the relational attributes is based on the dense linkage structure within a particular region. This is likely to make the model much more robust, because it is based on the local characteristics of the network within a particular region. This sharpens and magnifies the accuracy of the approach.

In addition to the locality-specific advantages of the class discrimination behavior, the clustering process also helps in segmenting the network into regions of much more manageable size. This is essential in a very large information network in which the number of nodes is very large, and there may be too many attribute values to process on a global basis. On the other hand, the set of relevant attribute values within a particular local region may be much more concise. This greatly helps in magnifying the link predictability properties. Once it is known that most of the edges in the network lie within these clustered regions, we can further use the information in the relational attributes in order to make more nuanced predictions on the linkages. The implicit effect of using this two-stage process is to use global linkage skews in order to make the macro-decisions of where most of the links are located, and then the local correlations among the attributes in the individual node types for the micro-decisions for picking the exact pairs of nodes at which to place these links. Therefore, we will describe the overall approach for link prediction by describing the following in the next subsections: (i) We will describe the methodology for creating the clusters as well as the maintenance of corresponding summary statistics. A proper choice of summary statistics is critical for an effective link prediction process. Therefore, we will first describe the summary statistics. (ii) In a later subsection, we will describe how to use these clusters and the associated statistics for link prediction. The described techniques use the summary statistics for the link prediction process.

2.1. Cluster Summary Statistics

The main idea is to create a compact characterization of the linkage behavior which is local to each cluster. The compactness of the characterization is useful in ensuring an efficient link-prediction process. We assume that each cluster consists of a set of nodes equation image, which are densely connected to one another with the use of links of different types. The summary statistics which are stored with the clusters are as follows:

  • We compute the frequency f(m,equation image) of the attribute value mequation image over the cluster equation image. Therefore f(m,equation image) is the number of nodes of cluster equation image in which the attribute value m is present.

  • We maintain the number of links of each type such that both of its end points lie in the cluster equation image. The number of such links of type k in cluster equation image is denoted by B(k,equation image) .

  • For all links (i,j) of type k for which both ends lie in the cluster equation image, we compute the number of occurrences of each attribute value mequation image which are present at the source of the link. If multiple links of type k emanate from i, then the corresponding link is also counted multiple times. This value is aggregated over all nodes i in the cluster. This is the origination frequency of attribute value mequation image for links of type k, and is denoted by O(m,k,equation image) .

  • For all links (i,j) of type k for which both ends lie in the cluster equation image, we compute the number of occurrences of each attribute value mequation image which are present at the destination of the link. If multiple links of type k are incident to j, then the corresponding link is also counted multiple times. This value is aggregated over all nodes j in the cluster. This is the destination frequency of attribute value mequation image for links of type k, and is denoted by E(m,k,equation image) .

  • For all links (i,j) of type k for which both ends lie in the cluster equation image, we compute the number of occurrences of the attribute pair m1equation image and m2equation image, such that m1 occurs at i and m2 occurs at j. This value is denoted by I(m1,m2,k,equation image) .

  • In addition, the similarity in attribute values across links can also be an indicator of linkage behavior. Therefore, for each attribute value m, we compute the number Qn(m,k,equation image) of links of type k which have the attribute value m at both ends within cluster equation image.

We note that all of the above statistics are based on attribute and link behavior, which are local to a particular cluster. We also maintain global statistics which are true across the different clusters. The main difference is that these statistics are maintained at the global level, rather than simply about the local behavior of particular clusters. These statistics are useful for making predictive decisions about the links, when the end points may lie in different clusters. We track the following analogous statistics:

  • The number of nodes at which the attribute m occurs globally is denoted by h(m).

  • The number of links of type k in the entire network is denoted by A(k).

  • For all links (i,j) of type k in the network, we compute the number of occurrences of each attribute value mequation image which are present at the source of the link. If multiple links of type k emanate from i, then the corresponding link is also counted multiple times. This value is aggregated over all nodes i in the network. This is the origination frequency of attribute value mequation image for links of type k, and is denoted by OG(m,k).

  • For all links (i,j) of type k in the network, we compute the number of occurrences of each attribute value mequation image which are present at the destination of the link. If multiple links of type k are incident to j, then the corresponding link is also counted multiple times. This value is aggregated over all nodes j in the network. This is the destination frequency of attribute value mequation image for links of type k, and is denoted by EG(m,k).

  • For all links (i,j) of type k for which the source i and destination j lie in different clusters, we compute the number of occurrences of the attribute pair m1equation image and m2equation image, such that m1 occurs at i and m2 occurs at j. This value is denoted by J(m1,m2,k).

  • For each attribute value m, we compute the number Pn(m,k) of links of type k which have the same attribute value m at both ends.

We will use these summary statistics in conjunction with the clustering process in order to accomplish the link prediction process. We note that these statistics are maintained dynamically along with the clusters. In the next section, we will discuss the process of dynamic cluster maintenance.

2.2. Dynamic Cluster and Statistics Maintenance

In this section, we will propose techniques for dynamic cluster maintenance with a focus on link prediction. The clustering process partitions the node set N into a group of clusters, equation image. Because the link inference method of this paper is designed for dynamic information networks, the clustering process needs to be dynamic as well. As an initialization step, we start off with the initial state of the information network clusters which are derived with the use of any of the standard node clustering algorithms [41]. For the purpose of this paper, we will use a simple graph partitioning algorithm which divides nodes into r partitions, so as to minimize the number of inter-partition edges. A classic example of this is the Kernighan-Lin algorithm [41]. Subsequently, we need a method to maintain both the clustering structure and the link statistics in the presence of dynamic changes of the information network. We assume that this dynamic nature of the information network is reflected by incoming graph objects, each of which may have a set of nodes and links. For example, in a co-authorship network, each object may correspond to a research paper, in which the nodes are papers, authors or conferences. The links may represent an authorship relationship or paper–conference relationships. It is possible that some of these nodes may not currently be present in the information network at all. For example, when a paper is written by an author who has not published before, this corresponds to a completely new node. We note that the incoming graph objects may not immediately affect the clustering structure of the underlying nodes, which are already present in the network. However, the addition of such objects may result in the movement of nodes from one partition to the other and vice-versa. Such changes may happen over time, when the structure of the network changes. While the network structure may be very large, its detailed structure needs to be tracked during the clustering process. Specifically, we need to maintain information about the nodes, their attribute values and their adjacent nodes. For this purpose, the adjacency list representation can be used very efficiently.

We dynamically maintain the sets of clusters equation image. When a new graph Gr arrives, its nodes (which are already present in the information network) are assumed to belong to the corresponding clusters. The new nodes are greedily assigned to the clusters which result in the least number of links across the different clusters. After this assignment process, we update the inter-cluster and intra-cluster-link-based statistics in the previous section. In many instances, it may be the case that the network structure changes over time, and therefore the assignment of nodes to clusters may change as well. For this, we only check the nodes involved in the current object to be reassigned. For each node in the current object, we sort them in random order and check if a reassignment to any of the other clusters reduces the number of links across the different clusters. If such is the case, then the reassignment is performed. We note that this step may also result in adjustment of inter-cluster and intra-cluster statistics. It is fairly straightforward to update these summary statistics by using the disk-resident representation to examine the node attributes and its outgoing links on disk. The reassignment of clusters changes the links within the clusters as well as the links across the clusters. Correspondingly, the statistics are also modified in order to reflect this readjustment. The overall update process for an incoming object is illustrated in Fig. 1.

thumbnail image

Figure 1. Incremental Clustering Process

Download figure to PowerPoint

2.3. Computing Content Predictability

In this section, we will discuss how to leverage the statistics collected in the aforementioned sections for the purpose of link prediction. In order to achieve this goal, we will first construct rules which relate the attribute values at the source and destination of the link to the probability of link prediction. For this purpose, the summary statistics maintained in each cluster are very useful. Therefore, we define the concept of local predictability of links with respect to particular attribute pairs.

DEFINITION 1: (Local Predictability) The local predictability S(m1,m2,k,equation image) of attribute pair (m1,m2) and link type k with respect to the cluster equation image is the probability that for a given node pair (i,j) completely contained within cluster equation image, the link (i,j) of type k exists, conditional on the fact that node i contains attribute value m1 and node j contains attribute value m2. This local predictability is estimated as a weighted average of four quantities, for weights α1…α4, which satisfy equation image:

  • equation image

We note that the weights α1…α4 can be learned by testing over a grid of values and picking the optimum combination on a small hold our portion of the training data. In the event that any of the fractions above is indeterminate1, we exclude that term from the computation.

Each of these terms corresponds to a different way of using the attribute structure in order to estimate the local link probability. A detailed explanation for each of these terms is as follows:

  • The first fraction equation image performs the estimation by analyzing the behavior of the links which emanate from a node containing a particular attribute type.

  • The second fraction performs the estimation by analyzing the behavior of the links which are incident on a node containing a particular attribute type.

  • The third fraction uses both the source and destination behavior of a particular node.

  • The last fraction uses only the frequency behavior of the links in the cluster and does not use attribute structure at all. This is useful in cases, where much information about the behavior of a particular kind of link may not be available.

A similar concept can be defined in terms of the global predictability G(m1,m2,k) of the attribute pair (m1,m2), with respect to link type k.

DEFINITION 2: (Global Predictability) The global predictability G(m1,m2,k) of attribute pair (m1,m2) and link type k is the probability that for a given node pair (i,j), the link (i,j) of type k exists, conditional on the fact node i contains attribute value m1 and node j contains attribute value m2. Let N be the total number of nodes in the network currently. This probability is estimated as a weighted average of four fractions, with weights β1…β4, which satisfy equation image:

  • equation image

As in the previous case, the values of β1…β4 can be learned by testing over a grid of values and picking the optimum combination. Note that the global predictability is useful in capturing the behavior of those links for which the end points do not lie in the same cluster. The concept of predictability essentially defines rules for the link-prediction process.

Therefore, the first step is to determine the values of S(m1,m2,k,equation image) and G(m1,m2,k), for each attribute-value pair (m1,m2), and sort them in descending order. Note that this is done offline periodically in the case of a dynamic network, because it may be time-consuming to compute this statistic over all pairs of attribute values (m1,m2).

In addition, we determine the discriminatory attributes values which are based on content similarity. The locally discriminatory attribute values for cluster equation image and link type k, denoted by La(k,equation image) , are all the attributes m for which the value of Qn(m,k,equation image) /f(m,equation image) 2 is larger than the mean value over all the attribute values. Similarly, we define Pa(k) as the set of all attributes for which the value of Pn(m,k)/h(m)2 is larger than the mean value over all attribute values. Note that La(k,equation image) and Pa(k) define attribute values for which similarity between nodes also defines higher probability of a link.

2.4. Dynamic Structural Measures

In addition to the local content-based similarity measures, we also calculate the pairwise structural similarity between nodes. It is important to note that while pairwise similarity measures between attributes are on the basis of content, the pairwise similarity measures between nodes are on the basis of structure. The pairwise structural similarity between nodes is computed as a weighted function of the following quantities: (i) the decay-weighted number of links between the two entities and (ii) the decay-weighted similarity in neighbors between the two entities.

In order to enable efficient and dynamic computation of the link prediction process, we do not use more complex structural measures such as path lengths between nodes. As the techniques discussed in this paper are designed for the case of a dynamic network, it is important to use temporal decay in the process of modeling the number of links between the two entities. We did not use the decay behavior in the content-based similarity because we generally found the content behavior across the network to be much more stable with time as compared with the structural behavior. Therefore, it is more critical to use the decay behavior in structural computations as compared to the content computations. We define the decay-weighted frequency of a link as follows:

DEFINITION 3: Let t be the current time, and t1…tr be the time stamps at which the link between a particular pair of links (i,j) were received. Then, the decay weighted frequency DF(i,j,k,t) is of link (i,j) of type k at time t defined as equation image. Here λ is the decay parameter.

The decay-based frequencies of the neighbors of each node are dynamically tracked over time. We keep track of only the neighbors of each node which have nonzero frequency. We note that this can be a challenge, because the decay based frequencies are continuously changing at each tick, and we do not want to update all the frequencies at a given time. However, the update can be performed in a lazy fashion, as the decay-based frequencies for all nodes decay at the same rate unless a new link is added. This refers to the following observation:

OBSERVATION 1: If the link (i,j) is not received in the time interval [t1,t2], then we have:

  • equation image(1)

Therefore, we make the multiplicative update for the decay function only when a new link is added. Therefore, if ts was the last time a link (i,j) of type k was received, and tc be the current time at which a link is received, then, we update the decay-based frequencies only at times ts and tc. At current time tc, we first multiply the link frequency DF(i,j,k,ts) by equation image, and then add 1.

Thus, for each link type, each node has a vector of decay frequencies which are dynamically maintained along with it. The length of this vector is essentially the number of neighbors of that node which are based on links of type k. We define the structural similarity vector at a node as follows:

DEFINITION 4: The structural similarity vector of a node i at time tc for links of type k is the set NS(i,k) of neighbors of that node together with the value of DF(i,j,k,tc) for each jNS(i,k).

We further note that in many cases, the decay process will ensure that some components of this vector will become smaller and smaller over time. These correspond to those nodes which may have been a neighbor at some point, but have not been active neighbors for a while. Such components do not contribute much to the computation process, but they increase the space- and time requirements. Therefore, it is best to prune such components. Therefore, at the time of updating a node, we check all the components of the vector of that node, and remove all components which are less 0.1% the magnitude of the average component in it. We note that because such networks are typically sparse (and an even smaller percentage of the links are active), the vector maintained at each node is very small. Thus, for each node, we maintain a list of the neighbor nodes with a nonzero component of the decay frequency, and the actual value of the decay frequency. Then, the structural similarity between a pair of nodes for links of type k at time tc can be computed in the form of the following two measures:

  • The first measure is the direct structural similarity DF(i,j,k,tc).

  • The indirect structural similarity is the dot product of the structural similarity vectors of i and j for links of type k. This number is denoted by IDF(i,j,k,tc). In other words, of equation image and equation image be the vectors at i and j respectively, then the dot product is given by:

    • equation image(2)

2.5. Queries

The statistics which are computed above can be leveraged for an effective link prediction process. We describe the techniques below:

Query 1: Determine the predictability-score of a link of type k between a particular pair of nodes i and j.

We note that the predictability score is a number which helps in the relative rankings of linkages between nodes, rather than serving as a true indicator of predictability values. There are several factors which are combined in order to compute the final predictability score. These factors are as follows: (i) the content-based predictability, (ii) the content similarity and (iii) the (direct and indirect) structural similarity.

In order to resolve this query, we first determine the sets of attribute values V (i) and V (j) present at nodes i and j. In addition, we determine the cluster memberships of nodes i and j respectively. In the event that the cluster memberships of nodes i and j are not the same, then we use the global predictability G(m1,m2,k) for each attribute-value pair m1V (i) and m2V (j). The average of the top t predictability values among the different pairs are computed as a first step in order to create the predictability score. On the other hand, if the nodes i and j belong to the same cluster equation image, then we repeat the same computation with the use of the local predictability values S(m1,m2,k,equation image) . This component defines the content-based predictability and is denoted by CP(i,j,k,tc) at the current time tc.

Furthermore, we include a contribution for the similarity in attribute values between the node pairs. Specifically, we add the cosine similarity between V (i) ∩ Pa(k) and V (j) ∩ Pa(k) to the predictability score, or the cosine similarity between V (i) ∩ La(k,equation image) and V (j) ∩ La(k,equation image) if the cluster memberships of nodes i and j are the same. This value is denoted by CS(i,j,k,tc) at the current time tc.

Finally, we also have the direct and indirect structural similarity components. These structural similarity values are denoted by DF(i,j,k,tc) and IDF(i,j,k,tc). Then, the total link prediction score TPS(i,j,k,tc) is defined as a weighted sum of these different components with the use of balance parameters γ1…γ4, and is defined as follows:

  • equation image

The value of the balance parameters γ1…γ4 is chosen in a data driven manner by testing for different variations over a small part of the training data, and then picking the optimum value of the combination for the test data. A small grid of values for the balance parameters γ1…γ4 is used, and this is used for the testing for the optimum combination over a small part of the training data set.

Query 2: Determine the q most likely links of type k emanating from node i.

In this case, the response to the query is in the form of a ranked list of all the links emanating from node i. One possible way to achieve a resolution to this query is to repeat the query over all possible pairs of nodes emanating from node k. This can however be time consuming, since the number of possible nodes in the information network can be very large. Therefore, a natural way to resolve the query is to first identify a small structural locality of the network based on the decay-based values DF(i,j,k,tc). We first determine all nodes in the network which are within a distance at most h, of node i with the use of only nodes for which DF(i,j,k,tc) > φ, where φ is a small number such as 0.1. This effectively uses only the active neighbors of each node in the exploration process. The value of h is typically a small number such as 2 or 3. Once these nodes have been identified, we can repeat the process of query 1, directly on this much smaller subset of nodes.

Query 3: Determine the q most likely links of type k.

The naive way of solving this problem would be to apply query 1 over all pairs of nodes. However, this can be extremely inefficient, as the number of pairs of nodes is quadratically related to a potentially large number. As in the previous case, we construct a network which is based on edges (i,j) for which the value of DF(i,j,k,tc) is at least φ. For each node i, we compute the aggregate value of DF(i,j,k,tc) of all nodes j incident in it. We process the nodes in decreasing order of this aggregate value, and repeat the process of query 2 in order to determine the most likely links. We dynamically keep track of the q most likely links. The processing of each node may lead to some new links which join the set of q most likely links. However, as more and more nodes are processed, the updating of the set of most likely links happens less frequently. We terminate, when an update does not happen in at least t consecutive iterations. We note that this is a heuristic termination point, but at large values of t, such as 1% of the number of nodes, this provides an effective solution.

3. COMMUNITY PREDICTION

  1. Top of page
  2. Abstract
  3. 1. INTRODUCTION
  4. 2. LINK INFERENCE: THE DYNAMIC NETWORK CLUSTERING APPROACH
  5. 3. COMMUNITY PREDICTION
  6. 4. EXPERIMENTAL EVALUATION
  7. 5. CONCLUSIONS AND SUMMARY
  8. 6. ACKNOWLEDGMENTS
  9. REFERENCES

The link prediction problem is closely related to the problem of community prediction. In the link prediction problem, we attempt to determine the most likely links which would occur in the future. In the community prediction problem, the goal is to determine the communities which are most likely to occur in the future. While community prediction is somewhat related to evolutionary clustering [28], the former is forward looking in its scope of predicting communities which are likely to form in the future, whereas evolutionary clustering is backward looking in terms of looking at the past changes in the clustering behavior of the data.

In this case, we will perform the community prediction with links of a particular type k at time tc. As before, the total prediction score TPS(i,j,k,tc) can be used in order to predict the future communities at time tc. Ideally, we would like to partition the nodes into a set of clusters, such that the weight of the intra-cluster links is as large as possible, when measures in terms of TPS(i,j,k,tc). Therefore, if we want to partition the data into η different communities, the problem of community prediction can be re-formulated as follows:

DEFINITION 5: [Community Prediction] Partition the node set N into η different clusters equation image, so that the following objective function O(k,tc) is maximized:

  • equation image(3)

We note that the objective function O(k,tc) represents the quality of the communities equation image, when expressed in terms of links of type k at time tc. The maximization of this objective function determines the communities with the use of links of type k only. We note that formulation does not include the predictability scores for node pairs, for which the underlying edges are already existing in network at time tc. This ensures that the communities are constructed on the basis of new edges which are forming in the network, rather than edges which are already present in the network at time tc. The aforementioned formulation generalizes easily to the case of homogeneous networks, by treating all links, as if they are of the same type.

One possible solution to this problem is to compute the value of TPS(i,j,k,tc) for each pair of nodes i and j, and use any agglomerative algorithm for the clustering process. However, such an approach may be too slow in practice for very large networks. This is because we would need to compute TPS(i,j,k,tc) for every pair of nodes i and j. The number of such pairs scales with the square of the number of nodes. This can be computationally very expensive, when the number of nodes in the network is too large. Another problem with the above formulation is that it does not address the issue of overlaps among the different communities. Typically, a community may evolve in a variety of ways, and the same node may be included in different communities along different lines of evolution. Therefore, it makes sense to determine the locally evolving communities centered at a given node. In many cases, the same node may be included in different communities, which are centered around different nodes. If desired, this process can be used in order to determine the overall communities by repeatedly sampling nodes from the network, and building the communities around them.

For the process of local community formation, an additional input to the algorithm is the identity of the node around which the community needs to be constructed. Let i be the node around which the community is predicted. We identify a small structural locality of the network based on the decay-based values DF(i,j,k,tc) around the node i. We first determine all nodes in the network which are within a distance at most h, of node i with the use of only nodes for which DF(i,j,k,tc) > φ, where φ is a small number such as 0.1. The value of h is a small number such as 2 or 3. Let S be the set of nodes which have thus been identified. For all pairs of nodes in this set, we determine all those pairs for which the value of TPS(i,j,k,tc) is larger than a predetermined threshold Φ. We create a new subnetwork containing only edges for which the value of TPS(i,j,k,tc) is larger than the threshold Φ. Furthermore, we remove those edges, which are already existing in the network at time tc. We determine the subnetwork which is connected to i with the use of these edges. This creates the local community containing node i.

This approach can also be used to provide a rough idea of the globally forming communities in the network by repeatedly sampling nodes, and building the communities around these nodes. A node is sampled only if it is not already included in some community in the network. The process is repeated, until a majority (say 80%) of the nodes in the network have been covered by at least one community. We note that such an approach will yield globally forming communities which are highly overlapping in nature.

4. EXPERIMENTAL EVALUATION

  1. Top of page
  2. Abstract
  3. 1. INTRODUCTION
  4. 2. LINK INFERENCE: THE DYNAMIC NETWORK CLUSTERING APPROACH
  5. 3. COMMUNITY PREDICTION
  6. 4. EXPERIMENTAL EVALUATION
  7. 5. CONCLUSIONS AND SUMMARY
  8. 6. ACKNOWLEDGMENTS
  9. REFERENCES

In this section, we will test the effectiveness and efficiency of our proposed DYNALINK algorithm on a number of real data sets. We will first give a description of the data sets, and discuss the experimental setup. Finally, we will present the experimental results. For the purpose of comparison, we choose to use three baselines: the first one is flow-based prediction method (PROPFLOW ) proposed in ref. 31; the second one is a chance-constrained link prediction formulation (CBSOCP) introduced in ref. 9; and the last one is a feature similarity based method (ADAMIC-ADAR) as in ref. 11. As we will see later, our experimental studies sufficiently illustrate that our approach can effectively predict future links in a dynamic scenario on both heterogeneous and homogeneous graphs. Furthermore, in spite of the greater generality of the DYNALINK algorithm, it is much more effective and efficient even in the homogeneous scenarios which are designed for all the baseline methods.

4.1. Data Sets

The algorithms were tested on three real data sets, which are similar with the ones used in the baseline algorithm described in ref. 9. The main difference is that a heterogeneous network structure was derived from some of the data sets in order to test the effectiveness of the DYNALINK algorithm in this scenario. Furthermore, a dynamic environment was simulated to test the dynamic aspects of our algorithm. The three data sets used are described below.

The first data set is a heterogeneous co-authorship network derived from the well-known DBLP data set2. We extract all the papers published in 20 conferences related to database, data mining, information retrieval and machine learning from 1996 to 2009.

The other two are the Genetics and Biochemistry data sets, which are derived from the popular PubMed database3. In particular, the Genetics dataset includes a collection of publication in 14 journals related to genetics and molecular biology from 1996 to 2005, while the Biochemistry data set contains articles published in 5 journals related to biochemistry also from 1996 to 2005. The size of three data sets is summarized in Table 1.

Table 1. Data description.
Data set# Papers# Authors# Edges
DBLP23 32925 950107 997
Genetics11 46341 868159 746
Biochemistry14 15149 982184 029

4.2. Experimental Setup

As discussed earlier, our proposed dynamic approach focuses on predicting linkage in dynamic information networks in a dynamic and temporal way. In order to simulate this dynamic scenario, we divide each data set into three parts as follows:

  • The first part is for initialization which includes graph partitioning and feature extraction.

  • The second part acts as the dynamic part where we dynamically update the summary statistics and structural similarities.

  • The last part corresponds to the testing set.

For the case of the DBLP data set, the papers collected between 1996 and 2003 are used for initialization, while we treat all the later publications as new incoming objects till 2008. The papers in 2009 are used to generate the test set. For both the Genetics and Biochemistry data sets, the initialization set includes all articles in the first 4 years (1996–1999), and then we continuously receive new publications from 2000 to 2004. As in previous cases, the publication collection of the last year is used for testing purposes.

As our proposed algorithm is expected to work well on both heterogenous and homogeneous cases, we generate our input from both perspectives. In the graph derived from the DBLP data set, there are two types of nodes: author and conference. Accordingly, we generate two types of links: author–author links and author–conference links. For the other two data sets, we test them in the homogeneous scenario. In other words, we generated only author nodes and author–author links for them.

4.2.1. Feature description and parameter setting

Because the data sets used in our experiment are derived from co-authorship networks, we decided to use the words in the paper titles as the attribute of each node. In general, an author who has published many papers has a longer list of attributes. The proposed algorithm has several parameters. In the graph partitioning step, we divided the graph into r = 100 partitions. The same number of 100 partitions is consistently maintained in the dynamic phase of the algorithm. The decay parameter λ in the calculation of the decay weighted frequency is set to be 1. We also tested several combinations of the balance parameters γ1, γ2, γ3 and γ4. Then we pick the combination of {0.2,0.1,0.5,0.2} as this is one of the settings that give us an effective value over different data sets. In PROPFLOW, we set maximum distance to explore the network as 4. The setting of the CBSOCP algorithm is exactly the same as described in ref. 9. In the setting of the ADAMIC-ADAR algorithm [11], we again extract words from the paper titles as the feature vector of each node.

4.3. Accuracy Analysis

In order to quantify the effectiveness of our approach, we use the concepts of precision and recall as evaluation method, and compare our result with PROPFLOW [31], CBSOCP [9] and ADAMIC-ADAR. We used an exactly similar testing methodology as discussed in CBSOCP. Since each data set has a large number of nodes and it is sometimes infeasible to test all combinations, the CBSOCP method considered all the links in the testing graph as positive examples and collect a sample of all the negative links as negative examples. Precisely, half of the negative links were chosen for testing purposes according to the method discussed in ref. 9. In order to ensure consistent comparison among the three algorithms, we used the same set of positive and negative examples in both cases.

To evaluate the effectiveness of our algorithm, we calculate all test cases with the model from the training process, and rank them in descending order of the prediction scores. Note that in the case of the CBSOCP method, the ranking is based on the margin of the classifiers. It is natural to choose the top-k links of the list to be predicted as positive, and thus precision and recall metrics can be calculated by varying the value of k. Here, precision is defined as the percentage of true positive links that are predicted correctly among the top-k predictions and recall is defined as the percentage of true positive links that are predicted correctly out of the complete set of true positive links. Higher values of k lead to lower precision but higher recall.

Figures 2 and 3 depict the prediction precision of the two different types of links in the heterogeneous graph derived from DBLP data set over different values of k. The value of k is illustrated on the X-axis, and the prediction precision is illustrated on the Y -axis. The value of k on the X-axis varies from 2000 to 10 000. Note that the training procedure of our DYNALINK algorithm is carried out as a single process for both author–author links and author–conference links. With the same model, we can predict different kinds of links even though the precisions are shown separately. In contrast, because all baseline algorithms PROPFLOW, CBSOCP and ADAMIC-ADAR can only work on a particular type of links, to make the comparison, we have to train the data twice, each of which focuses on a particular type. In spite of this, we can see that our dynamic link prediction scheme is significantly superior to the three baseline algorithms PROPFLOW, CBSOCP and ADAMIC-ADAR in terms of precision for both types of links. For example, when we set the value of k at 2000, the precision for our dynamic link prediction scheme is 73.8%, whereas that for the PROPFLOW method was only 4.8%, that for the CBSOCP method was 20.2%, and that for the ADAMIC-ADAR method was as low as 0.8%. The reason for the poor performance of ADAMIC-ADAR is that it is highly dependent on the attribute values, as compared with the other algorithms which use structure much more strongly over the attributes. As the attribute values were chosen to be keywords, they were typically not very discriminative for link prediction on a stand-alone basis. While the attributes are certainly helpful in performing a more effective link prediction, an algorithm which is overly dependent on the attribute values also becomes ineffective. In the case of the DYNALINK method, the attribute values were used only to make the local decisions about the link prediction process. Thus, the real role of attributes is really to provide additional information for more effective link discovery. As expected, the precision drops off for both methods as k increases. However, the DYNALINK algorithm continues to maintain reasonably high precision even when the value of k increases.

thumbnail image

Figure 2. Precision plot (dblp author–author links).

Download figure to PowerPoint

thumbnail image

Figure 3. Precision plot (dblp author–conference links).

Download figure to PowerPoint

The recall with increasing value of k of the DBLP data set are illustrated in Figs 4 and 5. As in the case of the precision plots, there are two recall plots for the heterogeneous case of DBLP data set, each of which contributes to a different type of links. The DYNALINK method is superior to all the baseline algorithms in terms of recall, which means that our dynamic prediction approach can retrieve more true positive links in the top-k predictions. This phenomenon is more pronounced in the case of DBLP data set shown in Figs 4 and 5. It is evident that the recall curve of CBSOCP algorithm in these two figures is almost flat, and therefore there is no additional advantage of picking a larger value of k for increasing recall. The line of PROPFLOW is even lower. The line of ADAMIC-ADAR is the lowest, and can only correctly predict a few links in the testing graph. This is again because of its very direct dependence on the attributes, rather than using the content in a supporting role. On the contrary, the corresponding recall curve of our method illustrates a rapidly increasing trend for larger values of k. For example, in the prediction of author–author links, the recall of the DYNALINK method is 14.5% when k is set to 2000, and it jumps to 49.8% when the first 10 000 predictions are chosen.

thumbnail image

Figure 4. Recall plot (dblp author–author links).

Download figure to PowerPoint

thumbnail image

Figure 5. Recall plot (dblp author–conference links).

Download figure to PowerPoint

The precision plot with increasing value of k for the Genetics data set is illustrated in Fig. 6. As in previous case, the DYNALINK scheme achieves much higher precision than the PROPFLOW, CBSOCP and ADAMIC-ADAR algorithms. The gap among the four curves in the precision plot is more obvious when the value of k is relatively small. For example, when we aim at top 2000 predictions, the precision of the DYNALINK method is 66.5% while the PROPFLOW algorithm has a precision of only 17.8%, the CBSOCP algorithm has a precision of only 25.1%, and the precision of the ADAMIC-ADAR is almost 0. Figure 7 shows the corresponding recall plot with increasing value of k for the Genetics data set. In this case, the recall curve shows an increasing trend for DYNALINK, PROPFLOW and CBSOCP methods while that of ADAMIC-ADAR still remains at very low value, for the same reasons as discussed earlier. However, the DYNALINK scheme consistently reaches a much higher recall value, and retrieves more true positive links over the entire range of values of k.

thumbnail image

Figure 6. Precision plot (Genetics).

Download figure to PowerPoint

thumbnail image

Figure 7. Recall plot (Genetics).

Download figure to PowerPoint

The precision and recall plots of the Biochemistry data set are illustrated in Figs 8 and 9 respectively. As can be seen from the figures, the DYNALINK scheme is extremely robust in the sense that it outperforms all the baseline algorithms for every value of k in both precision and recall plots. In the recall plot, the ADAMIC-ADAR algorithm is able to recover very few positive links. Even though the recall of the PROPFLOW and CBSOCP algorithms increases with k, our DYNALINK scheme has a much faster increasing trend. At the lower end, when k is set to be 2000, the recall of the DYNALINK method outperforms the CBSOCP algorithm by a factor of about 1.1. On the other hand, at the higher end, when we use the top 10 000 predictions, the DYNALINK method outperforms the CBSOCP algorithm by a factor of 1.5.

thumbnail image

Figure 8. Precision plot (Biochemistry).

Download figure to PowerPoint

thumbnail image

Figure 9. Recall plot (Biochemistry).

Download figure to PowerPoint

4.3.1. Local statistics

To further demonstrate the advantage of considering local statistics when building our link inference model, we test our method with only global statistics involved, and compare it with that of the regular DYNALINK model. The same measurement is used as introduced in the beginning of this section. The prediction precision of the two different types of links in the heterogeneous graph derived from DBLP data set is illustrated in Figs 10 and 11, respectively. As we can see from the figures, the performance of DYNALINK is consistently better than DYNALINK with global statistics only over different values of k. Especially when k is small, the advantage is more significant. For example, when k = 2000, the precision of DYNALINK for predicting author–author links is 0.738. However, if we only consider global statistics, the precision drops to 0.603. The corresponding recall plots of data set DBLP are shown in Figs 12 and 13. Unsurprisingly, DYNALINK achieves a higher recall score no matter what type of the link is. This further proves that with both local and global statistics included in the model it gives better performance. Figures 14–17 depict the corresponding results from both data sets Genetics and Biochemistry. Similar conclusion can be drawn from these figures, which is DYNALINK consistently outperforms DYNALINK(GLOBAL).

thumbnail image

Figure 10. Precision plot (dblp author–author links).

Download figure to PowerPoint

thumbnail image

Figure 11. Precision plot (dblp author–conference links).

Download figure to PowerPoint

thumbnail image

Figure 12. Recall plot (dblp author–author links).

Download figure to PowerPoint

thumbnail image

Figure 13. Recall plot (dblp author–conference links).

Download figure to PowerPoint

thumbnail image

Figure 14. Precision plot (Genetics).

Download figure to PowerPoint

thumbnail image

Figure 15. Recall plot (Genetics).

Download figure to PowerPoint

thumbnail image

Figure 16. Precision plot (Biochemistry).

Download figure to PowerPoint

thumbnail image

Figure 17. Recall plot (Biochemistry).

Download figure to PowerPoint

4.4. Efficiency Analysis

All experiments are performed on a Debian GNU/Linux server with two dual-core Xeon 3.0 GHz CPUs and 16GB main memory. The software was written in C++.

As in the case of the qualitative results, we used both the CBSOCP and ADAMIC-ADAR methods as the baseline approach. The computational efficiency of all three algorithms is illustrated in Table 2. Note that the running time shown in the table includes all parts of a complete training procedure. For the DYNALINK algorithm, a complete procedure involves initialization, attribute extraction and model statistics maintenance. On the other hand, the overall process of CBSOCP comprises feature calculation, clustering and model training as well, while the ADAMIC-ADAR algorithm mainly includes the calculation of feature similarities. We further note that while the DYNALINK algorithm can be maintained online in a dynamic way, this is not the case for either the CBSOCP algorithm or the ADAMIC-ADAR algorithm. As can be seen from the table, DYNALINK is better than ADAMIC-ADAR in terms of efficiency for both data sets Genetics and Biochemistry. Even though it is slower than ADAMIC-ADAR on data set DBLP, both the precision and recall scores of DYNALINK are much higher when those of ADAMIC-ADAR are very close to 0 as we have shown in previous subsection. In addition, we can see that for all three testing data sets, the DYNALINK algorithm runs faster than the baseline CBSOCP. One major reason is the features used in CBSOCP are more complicated and involve more calculation. In addition, the method in CBSOCP requires the implementation of a maximum margin classifier. This is also one of the reasons that CBSOCP cannot be implemented as an online or real-time algorithm. On the other hand, the DYNALINK method is naturally designed to provide efficient and real-time link inference.

Table 2. Computational time.
 DYNA- LINK (sec)PROP- FLOW (sec)CBSOCP (sec)ADAMIC-ADAR (sec)
DBLP107813.44599433
Genetics7859.312448933
Biochemistry145210.640261978

To further demonstrate that our proposed algorithm is highly efficient in terms of processing dynamic information networks, we also test the online model maintenance efficiency of the DYNALINK algorithm. For all three data sets, a new object is a newly published paper and inherently forms a small graph. Figures 18–20 depict the processing rate of our algorithm when the three data set continuously receive new objects over time. The X-axis in the figures denoted the publication time of the corresponding objects for temporal identification. The processing rate is defined as the average number of new incoming edges that can be processed every second. Every time when a new object arrives, the DYNALINK algorithm is expected to determine or reassign the cluster membership, update the node attributes and maintain summary statistics as well as the structural similarity information. We also observe that the processing rate of the Genetics and Biochemistry data sets is relatively higher than that of the DBLP data set. This is because of the fact that each node in these data sets have fewer attributes than that in the DBLP graph. In all cases, several thousand edges are processed each second, and therefore the proposed algorithm is very efficient, and can be effectively used for dynamic and online scenarios.

thumbnail image

Figure 18. Efficiency on data stream (dblp).

Download figure to PowerPoint

thumbnail image

Figure 19. Efficiency on data stream (Genetics).

Download figure to PowerPoint

thumbnail image

Figure 20. Efficiency on data stream (Biochemistry).

Download figure to PowerPoint

4.5. Scalability Analysis

In this section, we test the scalability of our proposed DYNALINK algorithm by varying the size of the input graph. To do so, edges in the original graphs are randomly sampled. We then measure the respective accuracy and efficiency. Here, accuracy is measured in terms of number of correct predictions out of top 2000 predictions.

Figures 21 and 22 illustrate the scalability results obtained from the heterogeneous data set DBLP. The number of edges in the input graph is depicted on the X-axis. As expected, the number of correct predictions is larger when there are more edges for training in the graph. However, even if the number of edges is small, DYNALINK still can achieve good accuracy. More importantly, DYNALINK is highly efficient and linearly scalable with regard to the data size. This is a very desirable property in the scenario of large-sized networks. The results of scalability test for data sets Genetics and Biochemistry are illustrated in Figures 23–26. Similar with what we have observed in the result of DBLP, the running time of DYNALINK increases linearly with the increase of graph size.

thumbnail image

Figure 21. Scalability plot—accuracy (dblp).

Download figure to PowerPoint

thumbnail image

Figure 22. Scalability plot—efficiency (dblp).

Download figure to PowerPoint

thumbnail image

Figure 23. Scalability plot—accuracy (Genetics).

Download figure to PowerPoint

thumbnail image

Figure 24. Scalability plot—efficiency (Genetics).

Download figure to PowerPoint

thumbnail image

Figure 25. Scalability plot—accuracy (Biochemistry).

Download figure to PowerPoint

thumbnail image

Figure 26. Scalability plot—efficiency (Biochemistry).

Download figure to PowerPoint

4.6. Community Prediction

We also studied the effectiveness of the community prediction algorithm by providing a case study of the nature of the communities found by the approach. As the community prediction algorithm is built around predicting communities around local nodes, we will present case studies and some efficiency results for the local community prediction process. We have presented some examples of the locally predicted communities for the year 2009 together with the target nodes in Table 3. We note that the locally predicted communities may also include members from the past (since the past communities are often repeated in the future), but at the same time, they also contain some members which may not necessarily be directly connected to the target node. One immediate observation is that the organizational influence on the communities is pervasive for all target nodes. This is natural, because the presence of two members in the same organization is more likely to result in them belonging to the same community. Furthermore, in a large majority of the cases, we found that the predicted nodes did eventually remain strongly connected with at least some members of the local community, even when it was not directly connected to the target node.

Table 3. Locally predicted communities around specific nodes.
Target NodeLocal Community
Jiawei HanCharu C. Aggarwal, ChengXiang Zhai, Venkatesh Ganti
 Kaushik Chakrabarti, Yuguo Chen, Xiaohui Gu, Jian Pei
 Wei Fan, Chen Chen, Hong Cheng, Dong Xin
 Jing Jiang, Tianyi Wu, Xifeng Yan, Xiaolei Li
 Qiaozhu Mei, Feida Zhu, Xiaoxin Yin
 Deng Cai, Xiaofei He, Sangkyum Kim, Philip S. Yu
 Xiaoyun Wu, Peixiang Zhao
Christos FaloutsosJon M. Kleinberg, Deepayan Chakrabarti, Masashi Yamamuro
 Tamara G. Kolda, Huiming Qu, Jure Leskovec, Andrew Tomkins
 Jimeng Sun, Agma J. M. Traina, Hanghang Tong, Mary McGlohon
 Eric P. Xing Fan, Guo Wenjie Fu
Kian-Lee TanAoying Zhou, Gao Cong, Rong Zhang, Yongluan Zhou
 Ying Yan, Zhenjie Zhang, Quang Hieu Vu, H. V. Jagadish
 Anthony K. H. Tung ,Beng Chin Ooi, Chee Yong Chan, Nan Wang
 Panos Kalnis, Mario A. Nascimento
Nick KoudasDimitrios Gunopulos, Beng Chin Ooi, Vagelis Hristidis
 Benjamin Arai, Fei Chiang, Divesh Srivastava, Nikos Sarkas
 Suresh Venkatasubramanian, Bing Tian Dai, Nilesh Bansal
 Sudipto Guha
Divesh SrivastavaH. V. Jagadish, Dimitrios Gunopulos, Beng Chin Ooi, Sudipto Guha
 Anthony K. H. Tung, Nick Koudas, Suresh Venkatasubramanian
 Bing Tian Dai
H. V. JagadishDivesh Srivastava, Aoying Zhou, Chee Yong Chan
 Anthony K. H. Tung, Rong Zhang. Zhenjie Zhang, Quang Hieu Vu
 Beng Chin Ooi, Nuwee Wiwatwattana, Kian-Lee Tan

We also tested the efficiency of local community prediction by determining the local communities around the top 20 most productive authors in DBLP. For each of them, we try to predict the local community around the node with different values of h, φ and Φ. The average running times with varying values of h, φ and Φ are reported in Figs 27, 28 and 30, respectively. According to the results from the figures, we can see that our community prediction procedure is highly efficient. For a given node, a community around it can be predicted as fast as 0.1 second. More importantly, the computational time is extremely stable over different value combinations of the three tested parameters. In other words, the time required to get the predicted community remains more or less the same over different values of h, φ or Φ changes.

thumbnail image

Figure 27. Running time for different values of h.

Download figure to PowerPoint

thumbnail image

Figure 28. Running time for different values of φ.

Download figure to PowerPoint

The running times and community sizes for different values of h are reported in Fig. 27 and Table 4 respectively. The other parameters are set to φ = 0.1 and Φ = 0.1 respectively. It is also evident from Fig. 27 that the running times are not very sensitive to the value of h. This is essential because the complexity of determining various intermediate quantities remained almost the same even when the value of h remained the same. Furthermore, the community size did not vary much, when the value of h was changed. This is evident from the results presented in Table 4, in which the community sizes are not very sensitive to different values of h.

Table 4. Average community size for different values of h.
Values of hAverage community size
28.444
39
49.0556
59.0556

We tested the sensitivity of the running times and community sizes to varying value of the parameter φ. The running times and community sizes with varying values of the parameter φ are illustrated in Fig. 28, and Table 5 respectively. The results of Table 5 are also presented in Fig. 29. The other parameters were set to h = 3 and Φ = 0.1 respectively. As in the previous case, it is evident that the computational times are not very sensitive to the value of φ. The community sizes reduce with increasing values of the parameter φ, because increasing values of φ reduce the number of paths to other community members.

thumbnail image

Figure 29. Average community sizes for different values of φ.

Download figure to PowerPoint

Table 5. Average community size for different values of φ.
Values of φAverage community size
0.019.4444
0.059.0556
0.19
0.28.6667
0.45.5556

The running times and community sizes for different values of Φ are illustrated Fig. 30 and Table 6 respectively. The corresponding variation in community size with Φ is also shown in Fig. 31. In each case, we set h = 3 and φ = 0.1 respectively. As in the previous cases, the running times were not very sensitive to values of Φ. On the other hand, the community sizes reduced with increasing values of Φ. This is because an increase in the value of Φ reduced the connectivity. This correspondingly reduced the community size.

thumbnail image

Figure 30. Running time for different values of Φ.

Download figure to PowerPoint

thumbnail image

Figure 31. Average community size for different values of Φ.

Download figure to PowerPoint

Table 6. Average community size for different values of Φ.
Values of ΦAverage community size
0.019.9444
0.059.5556
0.19
0.27.4444
0.43.7778

5. CONCLUSIONS AND SUMMARY

  1. Top of page
  2. Abstract
  3. 1. INTRODUCTION
  4. 2. LINK INFERENCE: THE DYNAMIC NETWORK CLUSTERING APPROACH
  5. 3. COMMUNITY PREDICTION
  6. 4. EXPERIMENTAL EVALUATION
  7. 5. CONCLUSIONS AND SUMMARY
  8. 6. ACKNOWLEDGMENTS
  9. REFERENCES

In this paper, we presented an algorithm for dynamic link inference in temporal and heterogeneous networks. The algorithm is designed to be extremely efficient and is able to construct link inference models for online and heterogeneous networks which are continuously evolving over time. We achieve this goal with the use of a dynamic clustering approach in conjunction with content-based and structural models. Our experimental results show that our approach is able to achieve superior accuracy because of its more sophisticated approach. At the same time our method is extremely efficient, and can be made to work effectively for the case of data streams. In addition to being an online algorithm, it is also much more efficient than state-of-the-art methods for link prediction.

6. ACKNOWLEDGMENTS

  1. Top of page
  2. Abstract
  3. 1. INTRODUCTION
  4. 2. LINK INFERENCE: THE DYNAMIC NETWORK CLUSTERING APPROACH
  5. 3. COMMUNITY PREDICTION
  6. 4. EXPERIMENTAL EVALUATION
  7. 5. CONCLUSIONS AND SUMMARY
  8. 6. ACKNOWLEDGMENTS
  9. REFERENCES

Research of the first author was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

This work of the second and third authors was supported in part by NSF through grants IIS-0905215, CNS-1115234, IIS-0914934, DBI-0960443 and OISE-1129076, US Department of Army through grant W911NF-12-1-0066, and KAU grant.

REFERENCES

  1. Top of page
  2. Abstract
  3. 1. INTRODUCTION
  4. 2. LINK INFERENCE: THE DYNAMIC NETWORK CLUSTERING APPROACH
  5. 3. COMMUNITY PREDICTION
  6. 4. EXPERIMENTAL EVALUATION
  7. 5. CONCLUSIONS AND SUMMARY
  8. 6. ACKNOWLEDGMENTS
  9. REFERENCES
  • 1
    C. Aggarwal, Y. Xie, and P. Yu, On dynamic link prediction in heterogeneous networks, In SDM Conference, 2012.
  • 2
    J. Leskovec, Tutorial summary: Large social and information networks: opportunities for ML, In ICML, 2009, 179.
  • 3
    S. F. Adafre and M. Rijke, Discovering missing links in Wikipedia, In LinkKDD, 2005, 9097.
  • 4
    C. Aggarwal, Social Network Data Analytics, Springer, New York, 2011.
  • 5
    M. Al-Hassan, V. Chaoji, S. Salem, and M. J. Zaki, Link prediction using supervised learning, In SDM Workshop on Link Analysis, Counter-terrorism and Security, 2006.
  • 6
    B. Taskar, M. F. Wong, P. Abbeel, and D. Koller, Link prediction in relational data, In NIPS, 2003.
  • 7
    H. Kashima and N. Abe, A parameterized probabilistic model of network evolution for supervised link prediction, In ICDM, 2006, 340349.
  • 8
    D. Liben-Nowell and J. Kleinberg, The link prediction problem for social networks, In CIKM, 2003, 556559.
  • 9
    J. R. Doppa, J. Yu, P. Tadepalli, and L. Getoor, Chance constrained programs for link prediction, In NIPS Workshop on Analyzing Networks and Learning with Graphs, 2009.
  • 10
    M.-S. Kim and J. Han, A particle-and-density based evolutionary clustering method for dynamic networks, PVLDB 2(1) (2009), 622633.
  • 11
    J. Kunegis and A. Lommatzsch, Learning spectral graph transformations for link prediction, In ICML, 2009, 561568.
  • 12
    J. Hopcroft, T. Lou, and J. Tang, Who will follow you back? reciprocal relationship prediction. In CIKM Conference, 2011.
  • 13
    J. Doppa, J. Yu, P. Tadepalli, and L. Getoor, Link mining: a survey, In SIGKDD Explorations, 2005, 312.
  • 14
    L. Adamic and E. Adar, Friends and neighbors on the web, Soc Netw 25 (2001), 211230.
  • 15
    M. E. J. Newman, Clustering and preferential attachment in growing networks, Phys Rev Lett 64 (2001), 016131.
  • 16
    M. Bilgic, G. Namata, and L. Getoor, Combining collective classification and link prediction, In ICDM Workshop on Mining Graphs and Complex Structures, 2007.
  • 17
    L. Getoor, N. Friedman, D. Koller, and B. Taskar, Learning probabilistic models of relational structure, In ICML, 2001, 170177.
  • 18
    L. Getoor, N. Friedman, D. Koller, and B. Taskar, Learning probabilistic models of link structure, J Mach Learn Res 3 (2002), 679707.
  • 19
    O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang, A framework for semantic link discovery over relational data, In CIKM, 2009, 10271036.
  • 20
    K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu, Stochastic relational models for discriminative link prediction, In NIPS, 2006, 15531560.
  • 21
    J. Zhu, J. Hong, and G. Hughes, Using Markov models for web site link prediction, In ACM Hypertext & Hypermedia Conference, 2002, 169170.
  • 22
    L. Backstrom and J. Leskovec, Supervised random walks: predicting and recommending links in social networks. In WSDM, Hong Kong, 2011.
  • 23
    Y. Sun, R. Barber, M. Gupta, C. Aggarwal, J. Han, Co-author relationship prediction in heterogeneous bibliographic networks. In ASONAM, 2011.
  • 24
    Y. Sun, J. Han, C. Aggarwal, and N. Chawla, When will it happen–relationship prediction in heterogeneous information networks, In WSDM, 2012.
  • 25
    Y. Yang, N. Chawla, Y. Sun, and J. Han, Predicting links in multi-relational and heterogeneous networks, In ICDM Conference, 2012.
  • 26
    D. Wang, D. Pedreschi, C. Song, F. Giannotti, and A.-L. Barabasi, Human mobility, social ties, and link prediction, In KDD, 2011.
  • 27
    Y. Dong, J. Tang, S. Wu, J. Tian, J. Rao, and H. Cao, Link prediction and recommendation across heterogenous social networks, In ICDM Conference, 2012.
  • 28
    G. Qi, C. Aggarwal, and T. Huang, Link prediction across networks by cross-network biased sampling, In ICDE Conference, 2013.
  • 29
    J. Tang, T. Lou, and J. Kleinberg, Inferring social ties across heterogenous networks, In WSDM, 2012.
  • 30
    D. Bortner and J. Han, Progressive clustering of networks using structure-connected order of traversal, In ICDE Conference, 2010, 653656.
  • 31
    D. Chakrabarti, R. Kumar, and A. Tomkins, Evolutionary clustering, In ACM KDD Conference, 2006.
  • 32
    Y. Chi, X. Song, D. Zhou, K. Hino, and B. L. Tseng, Evolutionary spectral clustering by incorporating temporal smoothness, In KDD Conference, 2007, 153162.
  • 33
    A. Clauset, M. E. J. Newman, and C. Moore, Finding community structure in very large networks, Phys Rev E 70 (2004), 066111.
  • 34
    R. Lichtenwalter, J. Lussier, and N. Chawla, New perspectives and methods in link prediction. In KDD 2010.
  • 35
    R. Lichtenwalter and N. Chawla, Lpmade: Link prediction made easy, J Mach Learn Res 12 (2011), 24892492.
  • 36
    R. Lichenwater and N. Chawla, Vertex collocation profiles: subgraph counting for link analysis and prediction, In WWW Conference, 2012.
  • 37
    R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Trawling the web for emerging cyber-communities, Comp Netw 31(11–16) (1999), 14811493.
  • 38
    T. Yang, R. Jin, Y. Chi, and S. Zhu, Combining link and content for community detection: a discriminative approach, In KDD Conference, 2009, 927936.
  • 39
    Z. Zeng, J. Wang, L. Zhou, and G. Karypis, Out-of-core coherent closed quasi-clique mining from large dense graph databases, ACM Trans Database Syst 31(2) (2007), Article No 13.
  • 40
    Y. Zhou, H. Cheng, and J. X. Yu, Graph clustering based on structural/attribute similarities, In Proc VLDB Endow 2(1) (2009), 718729.
  • 41
    C. Aggarwal and H. Wang, Managing and Mining Graph Data, Springer, 2010.
  • 1

    This refers to the fact that the numerator and the denominator of the fraction may be 0.

  • 2

    http://dblp.uni-trier.de/

  • 3

    http://www.ncbi.nlm.nih.gov/entrez