## Introduction

Social network analysis (SNA) is concerned with describing and explaining social structure by means of network theory. During the last decades, many measures and techniques originally devised for SNA have been successfully applied in other research fields. More recently, they have also been introduced in information science (e.g., Bj'rneborn, 2006; Kretschmer, 2004; Otte & Rousseau, 2002). Indeed, in the field of informetrics the interactions between documentary and/or social entities form an important study object (Wilson, 1999). They can be represented abstractly as networks of citations, collaborations, downloads etc.

The links in social and informetric networks do not appear randomly (see Newman (2003) for an overview of the differences between random and ‘real world’ networks). In the present paper, we explore factors that can be influential on link formation and evolution using an approach known as link prediction (LP). The problem that LP tries to tackle is this: given a snapshot of an evolving social network, how can one predict which new links will appear in some future snapshot of the same network? This and related questions have been studied by, among others, (Huang, 2006; Huang et al., 2005; Popescul & Ungar, 2003; Liben-Nowell & Kleinberg, 2003; Liben-Nowell & Kleinberg, 2007). As in previous studies, we will only study the prediction of links between existing vertices, 'links to new vertices are outside the scope of this paper.

The potential of LP can be seen both on a theoretical and a practical level. On a theoretical level, LP may help to test and validate the myriad of network (evolution) models, 'given enough data, such models can be tested with LP. Practically, one can imagine many possible applications: LP can be used to recommend related items in digital libraries, to suggest candidates for collaboration or relevant references in research. Further on in the paper, we outline an approach (referred to as ‘multi-input LP’) that could help university policy makers determine some of the factors that contribute to policy goals such as collaboration or internationalization.

Most research around LP consists of two broad steps.

- (i)Some predicting method is applied to a training network which results in a prediction of possible new links. This method can be simple (e.g., implementing a proximity measure) or elaborate. Intuitively, it makes sense to assume that those vertices that are already close in some sense (e.g. they share many friends) will likely form a link at some later point.
- (ii)The predictive power of the method is evaluated by comparing the prediction to an actual later snapshot (the test network). This is represented graphically in Figure 1.

The training network *G* = (*V*, *E*) consists of a set of nodes or vertices *V* and a set of links or edges *E*. Link prediction can then be formally characterized as a function *LP*: *V* - *V* → **R**, that maps a pair of vertices to a real-valued likelihood score *w* ∈ [0,1]. The score *w* expresses the likelihood that a link between these two vertices exists in the predicted network. Note that the link prediction function only indirectly specifies a new, ‘predicted’ network, e.g. if combined with a threshold value for the likelihood score. The predicted network then consists of all links whose associated likelihood score exceeds the threshold value. Moreover, since the function only assigns a score to vertex pairs from the training network, (links to) new vertices cannot be predicted.

LP as originally described is just one possible case in a larger ‘family’ of approaches. The next section discusses three ways in which LP can be generalized. These open the gate for so-called ‘multi-input LP’, which is based on more than one network. We explore the potential of both single-input and multi-input LP on a collaboration case study. The last section contains the conclusions.